knitr::opts_chunk$set(echo=TRUE, warning=FALSE,
message=FALSE, eval=TRUE, cache=TRUE,
fig.width=20, fig.height=20)
library(ncdf4)
## Helper function
windowsmooth <- function(x, n){
## x = rnorm(100)
## n=10
ii = seq.int(from=0, to=length(x)-n, length=n)
jj = ii + n
inds = Map(function(a,b)(a+1):b, ii, jj)
sapply(inds, function(ind)mean(x[ind]))
}
(Heavily borrowed from the Paul’s Bayesian workshop slides on matrix population modeling here slides
In cell-level (cytometry) data collected over time from the ocean, one interesting derived dataset is the empirical distribution of biomass (or counts) over time. Using scientific intuition about the cells’ interaction with the environment, we can model the evolution of the empirical biomass distribution over time. This lends valuable empirical evidence of our theoretical understanding of cells’ life cycles in the ocean.
We describe a matrix population model.
First, for simplicity, assume empirical biomass distribution is discrete and have \(m\) size classes:
The \(i\)’th size class has a diameter between \(v_i\) and \(v_{i+1}\), which are defined as:
\[ v_{i}=v_{\min } 2^{(i-1) \Delta_{v}} \text { for } i=1, \ldots, m+1.\]
The empirical distribution at time \(t\) is:
\[w(t) \in \mathbb{R}^{m},\]
whose \(i\)’th entry (\(i=1,\cdots,m\)) is the normalized counts (or amount of biomass) in the \(i\)’th size class; normalization is done so that \(\sum_{i=1}^{n} w_{i}(t)=1\).
The main mechanism for evolution over time is through the time propagation matrix:
\[A\left(t, \Delta_{t}\right) \in \mathbb{R}^{m \times m} \quad\]
defined as the transition of the distribution from time \(t\) to a later time \(t+\Delta_t\). This matrix will directly encode three basic type of cell activities.
Growth: The entries in the blue, which are subdiagonal entries, represent a transition to the next larger size class.
Division: The entries in the orange cells, \(j\) away from the diagonal, represent a division of cells so that biomass/counts transition from the \(j\)’th cell to the size class in the diagonal, representing an exact halving of the cell size. (The value of \(j\) is determined by the diameters in the size class)
Stasis: The diagonal entries. The rest of the cells remain at their size class.
Using this evolution mechanism, the size distribution at the next time is defined as:
\[ w\left(t+\Delta_{t}\right)=\frac{A\left(t, \Delta_{t}\right) w(t)}{\left\|A\left(t, \Delta_{t}\right) w(t)\right\|_{1}},\]
which is the normalized size distribution after having evolved by a left multiplication by \(A(t, \Delta)\).
Driven directly by the sunlight.
\[\gamma\left(t, \Delta_{t}\right)=\Delta_{t} \gamma_{\max }\left(1-\exp \left(-\frac{E(t)}{E^{*}}\right)\right)\]
TODO: be more detailed about this e.g. define the variables, explain this effect in words.
TODO: will growth be size dependent? Ask the group’s opinion.
Possibly size-dependent, parametric or non-parametric. Splines are one way to go.
One major missing component (in the model described thus far) could be cell respiration, another major known cell activity that directly affects the cell size distribution. There were several ideas
Basically, two days’ worth of 2-hourly size distribution data collected in a controlled lab environment.
TODO: describe more rigorously here.
##' Helper to open Zinser data.
##' @param filename NC file filename, like
##' \code{"data/Zinser_SizeDist_calibrated-52-12.nc"}.
##' @return List containing data \code{obs} (T x m matrix), \code{time} (T
##' length list of minutes passed since the beginning), \code{num_sizeclass}
##' (Number of size classes, or m), and \code{v} (left-hand-side border of
##' size classes).
open_zinser <- function(filename = "Zinser_SizeDist_calibrated-52-12.nc",
datadir = "data"){
## Read data.
## "data/Zinser_SizeDist_calibrated-52-12.nc"
nc <- nc_open(file.path(datadir, filename))
PAR <- ncvar_get(nc,'PAR') ## Sunlight
w_obs <- ncvar_get(nc,'w_obs') ## Data
## List variables in the nc file.
names(nc$var)
## What are these?
m <- ncvar_get(nc,'m') ## Number of size classes
## Time (in number of minutes since the beginning)
time <- ncvar_get(nc,'time')
## Size classes
delta_v_inv <- ncvar_get(nc,'delta_v_inv')
v_min <- ncvar_get(nc,'v_min')
delta_v <- 1/delta_v_inv
v <- v_min * 2^(((1:m) - 1) * delta_v)
## Name of w_obs
rownames(w_obs) = time
colnames(w_obs) = v
## Return
return(list(obs = w_obs,
time = time,
num_sizeclass = m,
v = v,
nc = nc,
PAR = PAR))
}
We can see that the growth and division is in concordance with the sunlight variable par, over the two day period of the data.
The bottom plot is another version of the data at a different number of size classes. There are half as many size categories (26 of them), and the same number of (\(25\)) time points.
## Retrieve data
for(filename in c("Zinser_SizeDist_calibrated-52-12.nc",
"Zinser_SizeDist_calibrated-26-6.nc")){
## Load data
dat = open_zinser(filename)
## Plot *calibrated* Zinser data.
ncat = ncol(dat$obs) ## alternatively, ncat = dat$num_sizeclass
cols = colorRamps::blue2red(ncat)
ylim = range(dat$obs)
matplot(dat$obs, col = cols, lwd = seq(from = 0.1, to = 5, length = ncat),
lty = 1, type = 'l', ylab = "", xlab = "")
title(main = paste0("Zinser data, ", ncat, " size classes"))
## Making i_test
d = 3
i_test = rep(0, length(dat$time))
i_test[seq(from=d, to=length(dat$time), by=d)] = 1
## Add grey regions for the time points that we're going to leave out.
cex = 1
abline(v = which(i_test == 1), col='grey', lwd=3, lty=1)
for(ii in which(i_test == 1)){
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="grey", cex=cex)
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=16, col="white", cex=cex * 0.7)
points(x=rep(ii, dat$num_sizeclass), y=dat$obs[ii,], pch=1, col="grey", cex=cex)
}
## Add sunlight
scpar = scale(dat$PAR)
scpar = scpar - min(scpar)
scpar = scpar / max(scpar) * max(dat$obs)
lines(scpar, lwd = 5)
## Add legend
legend("topright", lwd=3, col='grey', pch=1, legend="Test data")
}
Collected from the ocean, following a parcel of water. Much more fine grained in time (TODO maybe we can utilize this?).
TODO: describe more rigorously here.
We can make a visualization of the seaflow data, from the file data/SeaFlow_SizeDist_regrid-15-5.nc in the github repository https://github.com/fribalet/Bayesian-matrixmodel .
## Setup
ndays <- 4.0
limit_to_numdays <- 4.0
stride_t_obs <- 10
data <- list()
data$dt <- 10
source(file.path('data_processing.r'))
## Visualize the data
ncat = nrow(data$obs)
cols = colorRamps::blue2red(ncat)
ylim = range(data$obs)
## Make two plots:
par(mfrow=c(2,1))
par = windowsmooth(data$PAR, ncol(data$obs))
## Smaller sizes
iis=1:7
matplot(t(data$obs), type='l', lwd=2, lty=1, ylim=ylim, col='grey80',
ylab="")
lines(par/max(par) * 0.2, lwd=3, lty=3)
cutoff = quantile(par/max(par), 0.2)
lines(pmin(par/max(par) * 0.2, cutoff), lwd=3, lty=1)
matlines(t(data$obs)[,iis], type='l', lwd=2, col=cols[iis], lty=1, ylim=ylim)
title(main="Smaller sizes")
## Larger sizes
iis=8:15
matplot(t(data$obs), type='l', lwd=2, col='grey80', lty=1, ylim=ylim,
main="Larger sizes", ylab="")
lines(par/max(par) * 0.2, lwd=3, lty=3)
cutoff = quantile(par/max(par), 0.2)
lines(pmin(par/max(par) * 0.2, cutoff), lwd=5, lty=1)
matlines(t(data$obs)[,iis], type='l', lwd=2, col=cols[iis], lty=1, ylim=ylim)
We can see that the
Small size distributions negatively correlate with the large size distributions,
Their diel cycles are clearly visible,
The small size distributions increase over night (because division is prominent), and
The large size distributions increase with a slight lag to the sunlight cycle.
At the minimum, we should set up the framework of:
We should look into whether we can model \(w(t)\) these directly as compositions. I don’t think this will be straightforward, or easy to write simply as a likelihood model.
The Matrix Population Model (MPM) takes as input a time series of phytoplankton cell sizes. “Size” here is in units of \(\mu m^{3},\) and is estimated from optical measurements. Though size is a continuous measurement, we discretize the measurement into \(m \in \mathbb{N}\) size bins, where \(m\) is a tuning parameter determined by the analyst. We have observations at \(T\) time points, where each time point is typically one hour apart. Thus, the data is the matrix \(X \in \mathbb{R}^{m \times T}\). The observation at time \(t\) can be denoted
\[X_{t}=\left(X_{1, t}, X_{2, t}, \ldots, X_{m, t}\right),\]
where each \(X_{i, t} \in \mathbb{Z}\) is the number of phytoplankton cells in size class \(i\) at time \(t .\) A natural approach to modeling this type of data is via a multinomial distribution: \[ X_{t} \sim \text { Multinomial }\left(\theta_{t}\right) \] where \(\theta_{t}=\left(\theta_{1, t}, \theta_{2, t}, \ldots, \theta_{m, t}\right) \in \Delta^{m},\) where \(\Delta^{m}\) is the space of \(m\) -dimensional probability simplices. However, this model assumes that we have observed the data directly, when in reality we have estimated the size distribution from optical measurements. Thus, we likely have overdispersion, which means the true underlying variance of our data is greater than what this simple model would suggest. One solution to this problem is to use a hierarchical model, where we treat the parameters \(\theta_{t}\) as themselves being random. This serves to increase the marginal variance of the \(X_{t}\) ’s, thus more closely matching the increased variance in the data. This model could take the following form: \[ \begin{aligned} X_{t} | \theta_{t}, \sigma & \sim \text { Multinomial }\left(\theta_{t}\right) \\ \theta_{t} | \sigma & \sim \text { Dirichlet }\left(\alpha_{t}\right) \\ \sigma & \sim \operatorname{Exp}(\lambda) \end{aligned} \] where \(\alpha_{t}=\sigma * \bar{\alpha}_{t}, \lambda \in \mathbb{R},\) and \(\bar{\alpha}_{t} \in \Delta^{m} .\) Here, \(\sigma\) is a dispersion parameter. The vectors \(\bar{\alpha}_{t}\) are estimated as follows: 1. Input: estimated projection matrices \(\{\boldsymbol{A}(t)\}_{t=0}^{T-1},\) initial size distribution \(\boldsymbol{w}(0)=\frac{X_{1}}{\left\|X_{1}\right\|_{1}}\) 2. \(\operatorname{Obtain} \widehat{\boldsymbol{w}}(t)=\prod_{k=1}^{t-1} \boldsymbol{A}(k) \boldsymbol{w}(0)\) 3. \(\operatorname{set} \bar{\alpha}_{t}=\frac{\boldsymbol{w}(t)}{\sqrt{\boldsymbol{w}(t) \|_{1}}}\)
In the meantime, we need a way to link the observed size proportions \(w(t)\) to those of our estimated \(\hat w(t)\) (whose posterior sample we observe). One way is to perform a log ratio transform, with respect to one category. Let’s say that that one category is the last size class. Then, the log ratio transform is
\[l_t[i] = log(\frac{y_t[i]}{y_t[n]}), i=1,\cdots, n\]
This is a mapping from \(\mathbb{R}^n\) to \(\mathbb{R}^{n-1}\), where the mapped values have no geometric constraint.
There is no obvious utility in terms of restoring normality or symmetry or bell shaped-ness – in fact, there is not that much evidence that the resulting posterior distribution of individual \(w(t)_i\)’s are pathologically non-bell-shaped (are they?). The log-ratio transform does seem to help with this though – see an artificial simulation of \(T=1000\) Poisson-distributed 3-variate counts, transformed to compositions, we can see that while this distribution are not pathological, the transformed version seems closer to normal except for in the tails.
## Generate a bunch of 3-length probabilities, from data. This is a weird
## potentially misleading example.
set.seed(0)
lambda = 100
n = 3
N = 1000
pmat = t(sapply(1:N, function(ii){
X = rpois(n, lambda)
p = X/sum(X)
}))
par(mfrow=c(3,3))
for(phi in seq(from=10, to=180, length=9)){
par(mar=c(1,1,1,1))
plot3D::scatter3D(pmat[,1], pmat[,2], pmat[,3],
col = rgb(0,0,1,0.2),
pch = 16,
bty='g', phi=phi,
ticktype = "detailed") ## Using detailed ticks
}
## Do an *isometric* log-ratio transform
pmat_transformed = t(apply(pmat, 1, function(myrow){
gm = prod(myrow)^(1/3)
## log(myrow[1:2]/myrow[3])
log(myrow[1:2]/gm)
}))
## Before transformation
library(ggplot2)
df = data.frame(x=pmat[,2], y = pmat[,3])
p <- ggplot2::ggplot(df, aes(x, y)) + geom_point() + theme_classic() + ggtitle("Before transformation")
ggExtra::ggMarginal(p, type = "histogram")
MVN::mvn(pmat, multivariatePlot= "qq")
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness -4.93710170227701 1 YES
## 2 Mardia Kurtosis 3.83167217173925 0.000127275259194981 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Column1 0.9990 0.8654 YES
## 2 Shapiro-Wilk Column2 0.9980 0.2671 YES
## 3 Shapiro-Wilk Column3 0.9978 0.2170 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th
## 1 1000 0.3337417 0.02783161 0.3333333 0.2474916 0.4237288 0.3147541
## 2 1000 0.3331800 0.02710450 0.3333333 0.2344322 0.4186851 0.3140579
## 3 1000 0.3330783 0.02759793 0.3333333 0.2348993 0.4141414 0.3152785
## 75th Skew Kurtosis
## 1 0.3531250 0.01464774 -0.118524973
## 2 0.3505155 0.11651038 -0.005163532
## 3 0.3519087 -0.04416553 0.243316731
title(main="2d Normal QQ plot")
## After transformation
df = data.frame(x=pmat_transformed[,1], y = pmat_transformed[,2])
p <- ggplot2::ggplot(df, aes(x, y)) + geom_point() + theme_classic()+ ggtitle("After log ratio transformation")
ggExtra::ggMarginal(p, type = "histogram")
MVN::mvn(pmat_transformed, multivariatePlot= "qq")
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 10.0161594436243 0.040156356265244 NO
## 2 Mardia Kurtosis 1.3746429184973 0.169242175907264 YES
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Column1 0.9985 0.5854 YES
## 2 Shapiro-Wilk Column2 0.9987 0.6657 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## 1 1000 0.0011523451 0.08401465 3.136535e-03 -0.2781441 0.2587734
## 2 1000 -0.0003436196 0.08176989 8.483996e-05 -0.3264952 0.2440725
## 25th 75th Skew Kurtosis
## 1 -0.05538410 0.06038627 -0.099377782 -0.05892365
## 2 -0.05742843 0.05242728 -0.006865054 0.04000420
title(main="2d Normal QQ plot")
Rather, the problem is in that proportions have a constant sum of \(1\), so that the euclidean distance between two proportion vectors in \(\mathbb{R}^n\) can be a misleading (distorted) quantity. (For instance, when \(n=2\), then all measurements lie on a line. so the error is not sensible; instead, one might just as well omit the second category and see the estimation error in the first category; the log ratio transform achieves this effect).
Note, if we observe the total number of particles \(C_t\) as well, so that we know the total particles at \(W(t) = C_t w(t)\), then we might be able to use a multinomial model
Some other leads: - https://link.springer.com/article/10.1007/s10260-020-00512-y - The Arcsine transformation: https://en.wikipedia.org/wiki/Cohen%27s_h - Aitchinson’s short book on compositions: - http://www.leg.ufpr.br/lib/exe/fetch.php/pessoais:abtmartins:a_concise_guide_to_compositional_data_analysis.pdf - http://www.sediment.uni-goettingen.de/staff/tolosana/extra/CoDa.pdf Currently the data model (i.e. likelihood model) is the sum of the data.
# From "stan/matrixmodel_sigmoidaldelta.stan"
...
for(it in 1:nt_obs){
if(i_test[it] == 1){
diff = 0.0;
for(iv in 1:m){
diff += fabs(mod_obspos[iv,it] - obs[iv,it]);
}
diff = diff/sigma;
log_like_test += normal_lpdf(diff | 0.0, 1.0);
}
}
log_like_test = log_like_test/n_test;
This is saying the data likelihood is that of the SUM of the errors in estimating one category across ALL categories is distributed as a truncated normal:
\[ \sum_{i=1}^m |w(t)[i] - W(t)[i] | \sim \mathrm{truncNorm}(0, 1)\]
While this will achieve the basic effect of “small” It allows for a large error in some categories, because the sum of the absolute values
Why is it bad for us to use the same \(\sigma\) across categories? In other words, do you want different “standards” in terms of estimation, across categories?